56

Algorithms for Binary Neural Networks

Algorithm 2 Discrete backpropagation via projection

Input:

The training dataset; the full-precision kernels C; the projection matrix W; the learning rates

η1 and η2.

Output:

The binary or ternary PCNNs are based on the updated C and W.

1: Initialize C and W randomly;

2: repeat

3:

// Forward propagation

4:

for l = 1 to L do

5:

ˆCl

i,j P(W, Cl

i); // using Eq. 3.43 (binary) or Eq. 3.59 (ternary)

6:

Dl

i Concatenate( ˆCi,j); // using Eq. 3.45

7:

Perform activation binarization; //using the sign function

8:

Traditional 2D convolution; // using Eq. 3.46, 3.47 and 3.48

9:

end for,

10:

Calculate cross-entropy loss LS;

11:

// Backward propagation

12:

Compute δ ˆ

Cl

i,j =

∂LS

ˆ

Cl

i,j ;

13:

for l = L to 1 do

14:

// Calculate the gradients

15:

calculate δCl

i; // using Eq. 3.49, 3.51 and 3.52

16:

calculate δW l

j ; // using Eq. 3.115, 3.116 and 3.56

17:

// Update the parameters

18:

Cl

i Cl

i η1δCl

i; // Eq. 3.50

19:

W l

j W l

j η2δW l

j ; //Eq. 3.54

20:

end for

21:

Adjust the learning rates η1 and η2.

22: until the network converges

We believe that compressed ternary CNNs such as TTN [299] and TWN [130] have

better initialization states for binary CNNs. Theoretically, the performance of models with

ternary weights is slightly better than those with binary weights and far worse than those

of real-valued ones. Still, they provide an excellent initialization state for 1-bit CNNs in

our proposed progressive optimization framework. Subsequent experiments show that our

PCNNs trained from a progressive optimization strategy perform better than those from

scratch, even better than the ternary PCNNs from scratch.

The discrete set for ternary weights is a special case, defined as Ω := {a1, a2, a3}. We

further require a1 =a3 = Δ as Eq. 3.57 and a2 = 0 to be hardware friendly [130].

Regarding the threshold for ternary weights, we follow the choice made in [229] as

Δl = σ × E(|Cl|)σ

I

I



i



Cl

i1



,

(3.58)

where σ is a constant factor for all layers. Note that [229] applies to Eq. 3.58 on convolutional

inputs or feature maps; we find it appropriate in convolutional weights as well. Consequently,

we redefine the projection in Eq. 3.29 as

PΩ(ω, x) = arg min

ai ωx2ai, i ∈{1, ..., U}.

(3.59)

In our proposed progressive optimization framework, the PCNNs with ternary weights

(ternary PCNNs) are first trained from scratch and then served as pre-trained models to

progressively fine-tune the PCNNs with binary weights (binary PCNNs).